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£N| Abstract 

This paper considers the use of a simple posterior sampling algorithm to balance between exploration 
D and exploitation when learning to optimize actions such as in multi-armed bandit problems. The algo- 

rithm, also known as Thompson Sampling, offers significant advantages over the popular upper confidence 
f — , bound (UCB) approach, and can be applied to problems with finite or infinite action spaces and compli- 

es) cated relationships among action rewards. We make two theoretical contributions. The first establishes 

a connection between posterior sampling and UCB algorithms. This result lets us convert regret bounds 
developed for UCB algorithms into Bayes risk bounds for posterior sampling. Our second theoretical 
contribution is a Bayes risk bound for posterior sampling that applies broadly and can be specialized 
l_J to many model classes. This bound depends on a new notion we refer to as the margin dimension, 

^ which measures the degree of dependence among action rewards. Compared to UCB algorithm Bayes 

risk bounds for specific model classes, our general bound matches the best available for linear models and 
is stronger than the best available for generalized linear models. Further, our analysis provides insight 
into performance advantages of posterior sampling, which are highlighted through simulation results that 
^ demonstrate performance surpassing recently proposed UCB algorithms. 

\q 1 Introduction 

We consider an optimization problem faced by an agent who is uncertain about how his actions influence 
performance. The agent selects actions sequentially, and upon each action observes a reward. A reward 
function governs the mean reward of each action. The agent represents his initial beliefs through a prior 
distribution over reward functions. As rewards are observed the agent learns about the reward function, and 
this allows him to improve behavior. Good performance requires adaptively sampling actions in a way that 
strikes an effective balance between exploring poorly understood actions and exploiting previously acquired 
knowledge to attain high rewards. In this paper, we study a simple algorithm for selecting actions and 
provide finite time performance guarantees that apply across a broad class of models. 

The problem we study has attracted a great deal of recent interest and is often referred to as the multi- 
armed bandit (MAB) problem with dependent arms. We refer to the problem as one of learning to optimize 
to emphasize its divergence from the classical MAB literature. In the typical MAB framework, there are a 
finite number of actions that are modeled independently; sampling one action provides no information about 
the rewards that can be gained through selecting other actions. In contrast, we allow for infinite action 
spaces and for general forms of model uncertainty, captured by a prior distribution over a set of possible 
reward functions. Recent papers have addressed this problem in cases where the relationship among action 
rewards takes a known parametric form. For example, (2 12 21 study the case where actions are described 



by a finite number of features and the reward function is linear in these features. Other authors have studied 



cases where the reward function is Lipschitz continuous 17 , sampled from a Gaussian process [25], or takes 
the form of a generalized |13| or sparse [3] linear model. 
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part by Award CMMI-0968707 from the National Science Foundation. 
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Each paper cited above studies an upper confidence bound (UCB) algorithm. Such an algorithm forms 
an optimistic estimate of the mean-reward value for each action, taking it to be the highest statistically 
plausible value. It then selects an action that maximizes among these optimistic estimates. Optimism 
encourages selection of poorly-understood actions, which leads to informative observations. As data accu- 
mulates, optimistic estimates are adapted, and this process of exploration and learning converges toward 
optimal behavior. 

We study an alternative algorithm that we refer to as posterior sampling. It is also also known as 
Thompson sampling and as probability matching. The algorithm randomly selects an action according to the 
probability it is optimal. Although posterior sampling was first proposed almost eighty years ago, it has until 
recently received little attention in the literature on multi-armed bandits. While its asymptotic convergence 



has been established in some generality 20 , not much else is known about its theoretical properties in the 
case of dependent arms, or even in the case of independent arms with general prior distributions. Our work 
provides some of the first theoretical guarantees. 

Our interest in posterior sampling is motivated by several potential advantages over UCB algorithms: 
design simplicity, finite time performance, and computational efficiency. To appreciate these advantages, it 
is important to first understand that a critical step of any UCB algorithm is the construction of a plausible 
set of reward functions: a set that contains the true function with high probability conditioned on past 
observations. For any given model, there is a great deal of design flexibility in choosing the geometric 
structure of such confidence sets, and this choice can drive performance and computational tractability. 

Because there is no need to generate confidence sets for posterior sampling, its use greatly simplifies 
the design process. Further, it is never clear whether a given UCB algorithm employs the "best" choice 
of confidence sets. In fact, sets have often been designed at least in part to facilitate theoretical analysis. 
We show through simulations that posterior sampling outperforms various UCB algorithms that have been 
proposed in the literature. Moreover, theoretical results we will present suggest that posterior sampling 
competes well with any UCB algorithm, regardless of the choice of confidence sets. 

Posterior sampling also offers computational advantages. With a UCB algorithm, an action is chosen 
through simultaneously optimizing over reward functions in a confidence set and over actions. In posterior 
sampling, on the other hand, only a single reward function is sampled, and an action is chosen to optimize 
this function. Sampling a single function is often much more efficient than optimizing over reward functions, 
enabling practical implementations in cases where UCB algorithms are computationally onerous. 

In this paper, we make two theoretical contributions. The first establishes a connection between posterior 
sampling and UCB algorithms. In particular, we show that typical analyses that yield Bayes risk bounds for 
UCB algorithms can also be applied to produce equivalent Bayes risk bounds for posterior sampling. This 
suggests that for any class of problems, posterior sampling satisfies bounds that can be established for a UCB 
algorithm that employs the best choice of confidence sets, and supports our belief that posterior sampling 
competes well with any UCB algorithm. We will also discuss how, for specific classes of models, a number of 
regret bounds that apply to specific UCB algorithms translate to Bayes risk bounds for posterior sampling. 

Our second theoretical contribution is a Bayes risk bound for posterior sampling that applies broadly 
and can be specialized to many specific model classes. Our bound depends on a new notion of dimen- 
sion that measures the degree of dependence among actions. We compare our notion of dimension to the 
Vapnik-Chervonenkis dimension and explain why that and other measures of dimension that are used in the 
supervised learning literature do not suffice when it comes to analyzing posterior sampling. 

In addition to theoretical results, we present a set of simulation results. These results reinforce the 
case that posterior sampling outperforms UCB algorithms that have been proposed for multi-armed bandit 
problems with dependent arms. 

The remainder of this paper is organized as follows. The next section discusses related literature. Section 
then provides a formal problem statement. We describe UCB and posterior sampling algorithms in Section 
4 We then establish in Section[5]a connection between them, which we apply in Section|6]to convert existing 
bounds for UCB algorithms to bounds for posterior sampling. Section [7] develops a new notion of dimension 
and presents Bayes risk bounds that depend it. Section [8] presents simulation results. A closing section 
makes concluding remarks. 
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2 Related Literature 



Though it was first proposed in 1933, posterior sampling has until recently received relatively little attention. 



Interest in the algorithm grew after empirical studies 11 24 demonstrated performance exceeding state-of- 



the-art methods. An asymptotic convergence result was established in 20 , but finite time guarantees 
remain limited. The development of further performance bounds was raised as an open problem at the 2012 
Conference on Learning Theory [19] . 

Several recent papers have established theoretical results on posterior sampling. One difference between 
these papers and ours is that we focus on a different measure of performance. These papers all study the 
algorithm's regret, which measures its cumulative loss, relative an algorithm that always selects the optimal 
action, for any fixed reward function. We study the algorithm's expected regret, where the expectation is 
taken with respect to the prior distribution over reward functions. This quantity is commonly called the 
algorithm's Bayes risk. We find this to be a practically relevant measure of performance and that this choice 
allows for more elegant analysis. Further, as we discuss in Section [3] asymptotic Bayes risk bounds are 
essentially asymptotic regret bounds as well. 

Three recent papers 4||5||T6 provide regret bounds for posterior sampling when applied to MAB problems 
with finitely many independent actions and rewards that follow Bernoulli processes. These results demon- 
strate that posterior sampling is asymptotically optimal for the class of problems considered. A key feature 
of the bounds is their dependence on the difference between the optimal and second-best mean-reward values. 
Unfortunately, such bounds tend not to be meaningful when there the number of actions is large or infinite. 

In this paper, we establish distribution-independent bounds. When the action space A is finite, we 
establish a finite time Bayes risk bound of order \f\A\ TlogT. This matches what is implied by the analysis 
of pi. However, our bound does not require independence among actions, and our approach also leads to 
meaningful bounds for problems with large or infinite action sets. 

Only one other paper has studied posterior sampling in a context involving dependent actions [6]. 
The paper considers a contextual bandit model with a finite number of arms whose mean-reward val- 
ues are given by a d-dimensional linear model. The cumulative T-period regret is shown to be of order 
dyj (T 1+e /e) In \A\ InT In i with probability at least 1 — 6. Here e £ (0, 1) is a parameter used by the algo- 
rithm to control how quickly the posterior distribution concentrates. The Bayes risk bounds we will establish 
are stronger than those implied by the results of 16] . In particular, we provide a Bayes risk bound of order 
dy/TlnT that holds for any compact set of actions. This is optimal up to a factor of InT |2l| . 

We are also the first to establish finite time performance bounds for several other problem classes. One 
applies to linear models when the vector of coefficients is likely to be sparse; this bound is stronger than the 
aforementioned one that applies to linear models in the absence of sparsity assumptions. We establish the 
the first bounds for posterior sampling when applied to generalized linear models and to problems with a 
general Gaussian prior. Finally, we establish bounds that apply very broadly and depend on a new notion 
of dimension. 

Unlike most of the relevant literature, we study MAB problems in a general framework, allowing for 
complicated relationships between the rewards generated by different actions. The closest related work is [7], 
which considers the problem of learning the optimum of a function that lies in a known, but otherwise 
arbitrary set of functions. They provide bounds based on a new notion of dimension, but unfortunately this 
notion does not provide a bound for posterior sampling. Other work provides general bounds for contextual 
bandit problems where the context space is allowed to be infinite, but the action space is small (see, e.g., [9]). 
Our model captures contextual bandits as a special case, but we emphasize problem instances with large or 
infinite action sets, and where the goal is to learn without sampling every possible action. 

A focus of our paper is the connection between posterior sampling and UCB approaches. We discuss 
UCB algorithms in some detail in Section ^] UCB algorithms have been the primary approach considered 
in the segment of the stochastic MAB literature that treats models with dependent arms. Other approaches 
are the knowledge gradient algorithm of [22], forced exploration schemes for linear bandits [T[|21], and 
exponential- weighting schemes [9]. 

There is an immense and rapidly growing literature on bandits with independent arms and on adversarial 
bandits. Theoretical work on stochastic bandits with independent arms often focuses on UCB algorithms 
[8)[l8] or on the Gittin's index approach [l4j. A review of work on UCB algorithms and on adversarial 



bandits can be found in [10] . Work on Gittin's indices and related extensions is covered in 15 
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3 Problem Formulation 



We consider a model involving a set of actions A and a set of real- valued functions T = {f p : A h-> K| p € 9}, 
indexed by a parameter that takes values from an index set O. We will define random variables with respect 
to a probability space (fi,F,P). A random variable 8 indexes the true reward function f§. At each time t, 
the agent is presented with a possibly random subset A t C A and selects an action A t € A t , after which she 
observes a reward R t . 

We denote by H t the history {A\,A\,R\, . . . , At-i, A t -i, Rt-i, At) of observations available to the agent 
when choosing an action A t . The agent employs a policy tt = {Tr t \t £ N}, which is a deterministic sequence 
of functions, each mapping the history H t to a probability distribution over actions A. For each realization 
of H t , w t (H t ) is a distribution over A with support At, though with some abuse of notation, we will often 
write this distribution as n t . The action A t is selected by sampling from the distribution n t , so that 
P(A t € -|7r t ) = V(A t e = 7r t (-)- We assume that E[i? t |i? t , 0, A t ] = / e (A t ). In other words, the realized 
reward is the mean-reward value corrupted by zero-mean noise. We will also assume that for each / G T and 
t e N, argmax ae ^ t /(a) is nonempty with probability one, though algorithms and results can be generalized 
to handle cases where this assumption does not hold. 

The T-period expected regret of a policy n is the random variable defined by 



Regret (T, ir, 8) = ^ E 



(=i 



max/o(a) - f e (A t ) 

a&At 



The T-period Bayes risk is defined by E [Regret (T, w, 8)], where the expectation is taken with respect to the 
prior distribution over 8. Hence, 



BayesRisk (T, tt) = ^ E 



t=i 



max /<? (a) - f g (A t ) 

aeA t 



Remark 1. Measurability assumptions are required for the above expectations to be well-defined. In order to 
avoid technicalities that do not present fundamental obstacles in the contexts we consider, we will not explic- 
itly address measurability issues in this paper and instead simply assume that functions under consideration 
satisfy conditions that ensure relevant expectations are well-defined. 

Remark 2. Asymptotic bounds on Bayes risk are essentially asymptotic bounds on regret. In particular, if 
BayesRisk(T, 7r) = 0(g(T)) for some non-negative function g then Rcgret(T, tt, 8) = Op(g(T)). 

Remark 3. Our stochastic model of action sets At is distinctive relative to most of the multi-armed bandit 
literature, which assumes that At = A. This construct allows our formulation to address a variety of 
practical issues that are usually viewed as beyond the scope of standard multi-armed bandit formulations. 
Let us provide three examples. 

Example 1. Contextual Models. The contextual multi-armed bandit model is a special case of the formu- 
lation presented above. In such a model, an exogenous Markov process X t taking values in a set X influences 
rewards. In particular, the expected reward at time t is given by fg{a,X t ). However, this is mathematically 
equivalent to a problem with stochastic time-varying decision sets At- In particular, one can define the set of 
actions to be the set of state-action pairs A := {(x, a) : x e A, a e A(x)}, and the set of available actions 
to be A t = {(X t , a) : a € A(X t )}. 

Example 2. Cautious Actions. In some applications, one may want to explore without risking terrible 
performance. This can be accomplished by restricting the set At to conservative actions. Then, the instan- 
taneous regret in our framework is the gap between the reward from the chosen action and the reward from 
the best conservative action. In many settings, the Bayes risk bounds we will establish for posterior sampling 
imply that the algorithm either attains near-optimal performance or converges to a point where any better 
decision is unacceptably risky. 

A number of formulations of this flavor are amenable to efficient implementations of posterior sampling. 
For example, consider a problem where A is a polytope or ellipsoid in R d and f$(a) = (a, 8). Suppose 8 
has a Gaussian prior and that reward noise is Gaussian. Then, the posterior distribution of 8 is Gaussian. 
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Consider an ellipsoidal confidence set Ut = {u | \\u — /it || St < 0), for some scalar constant /3 > 0, where 
fit and St are the mean and covariance matrix of 8, conditioned on H t . One can attain good worst-case 
performance with high probability by solving the robust optimization problem U ro bust = max ae ^i min Me ^ (a, u), 
which is a tractable linear saddle-point problem. Letting our cautious set be given by 

A t = \ a £ A | min (a, u) > V mhnst - a 

for some scalar constant a > 0, we can then select an optimal cautious action given 6 by solving 

maximize (a, 6) 
subject to a £ A 

INIs- 1 < \ ((a,th) ~ Kobust + ct) . 

This problem is computationally tractable, and that accommodates efficient implementation of posterior sam- 
pling. 

Example 3. Adaptive Adversaries. Consider a model in which rewards are influenced by the choices of 
an adaptive adversary. At each time period, the adversary selects an action from some set A~ based on 
past observations. The agent observes this action, responds with an action selected from a set A + , and 
receives a reward that depends on the pair of actions (Af,A[). This fits our framework if the action A t is 
taken to be the pair (Af,A"[), and the set of actions available to the agent is At = {(a,A^)\a £ A + }. 



max (a, 9) 



4 Algorithms 

We will establish finite time performance bounds for posterior sampling by leveraging prior results pertaining 
to UCB algorithms and a connection we will develop between the two classes of algorithms. As background 
and to set the stage for our analysis, we discuss the algorithms in this section. 

4.1 UCB Algorithms 

UCB algorithms have received a great deal of attention in the MAB literature. Such an algorithm makes use 
of a sequence of upper confidence bounds U = {Ut\t £ N}, each of which is a function that takes the history 
H t as its argument. For each realization of H t , Ut{H t ) is a function mapping A to R. With some abuse 
of notation, we will often write this function as Ut and its value at a £ A as Ut(a). The upper confidence 
bound U t (a) represents the greatest value of fg(a) that is statistically plausible given H t . A UCB algorithm 
selects an action A t £ argmax ae _4 t Ut(a) that maximizes the upper confidence bound. We will assume that 
the argmax operation breaks ties among optima in a deterministic way. As such, each action is determined 
by the history H t , and for the policy ir = {wt\t £ N} followed by a UCB algorithm, each action distribution 
7Tt concentrates all probability on a single action. 

As a concrete example, consider Algorithm [l] proposed in |8 to address MAB problems with a finite 
number of independent actions. For such problems, At = A, 6 is a vector with one independent component 
per action, and the reward function is given by fg(a) — 9 a . The algorithm begins by selecting each action 
once. Then, for each subsequent time t > \A\, the algorithm generates point estimates of action rewards, 
defines upper confidence bounds based on them, and selects actions accordingly. For each action a, the point 
estimate 9 t (a) is taken to be the average reward obtained from samples of action a taken prior to time t. The 
upper confidence bound is produced by adding an "uncertainty bonus" j3y/logt/N t (a) to the point estimate, 
where N t (a) is the number of times action a was selected prior to time t and j3 is an algorithm parameter 
generally selected based on reward variances. This uncertainty bonus leads to an optimistic assessment of 
expected reward when there is uncertainty, and it is this optimism that encourages exploration that reduces 
uncertainty. As N t {a) increases, uncertainty about action a diminishes and so does the uncertainty bonus. 
The log t term ensures that the agent does not permanently rule out any action, which is important as 
there is always some chance of obtaining an overly pessimistic estimate by observing an unlikely sequence of 
rewards. 
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Algorithm 1 Independent UCB 



l: Initialize: Select each action once 
2: Update Statistics: For each a € A, 

9 t {a) 4— sample average of observed rewards 
Nt(a) <— number of times a sampled so far 
3: Select Action: 

A t £ argmax ae ^ |^(c 



4: Increment t and Goto Step 2 



Algorithm 2 Linear UCB 



1: Initialize: Select d linearly independent actions 
2: Update Statistics: 

t <- OLS estimate of 9 



e*<- 



< py/di^i 



Select Action: 

A t e argmax ae ^ {max^e, (0(a), p)} 
Increment t and Goto Step 2 



Algorithm[2]is a variation of one proposed in [2T| to address a "linear bandit" problem, in which the reward 
function is linear in a d-dimensional vector 9. In particular, there is a known feature mapping : A — > M d 
such that an action a yields expected reward fg(a) = (4>(a), 9). Given past observations, the algorithm 
constructs a confidence ellipsoid f centered around a least squares estimate 9t and employs the upper 
confidence bound U t {a) := maxg 66t ((f)(a), 9) = ^0(a), 9tj + /3^Jd\og(t) ||0(a)|| $ -i. The term ||0(a)|| $ -i 
captures the amount of previous exploration in the direction 0(a), and, as with the case of independent arms, 
causes the uncertainty bonus j3^Jd\og(t) 110(0)11$-! to diminish as the number of observations increases. 

Bandits with independent actions are a special case of a linear bandit problem where the feature vectors 
of different actions are orthogonal. Unfortunately, the two algorithms are not equivalent in this setting. In 
this case ||0(a)||<£-i = y/l/Nt(a), and the uncertainty bonus of Algorithm 2 is inflated by a factor of \d 
relative to that of Algorithm [T] The dependence on d is required to ensure that the confidence ellipsoid 4 
contains the true parameter with high probability, and to our knowledge, there is no theoretical guarantee 
available for an algorithm that avoids this dependence. 

4.2 Posterior Sampling 

The posterior sampling algorithm simply samples each action according to the probability it is optimal. In 
particular, the algorithm applies action sampling distributions ir t — P {A* t £ ■ | H t ), where A* t is a random 
variable that satisfies A* t € arg max aSy !i t fg (a) . Practical implementations typically operate by at each 
time t sampling an index 9 t G from the distribution ¥ (9 £ ■ \ H t ) and then generating an action A t £ 
arg max aS _4 t fg (a) . To illustrate, let us provide concrete examples that address problems analogous to 
Algorithms [T] and [2] 

Our first example involves a model with independent arms. In particular, suppose 9 is drawn from a 
normal distribution N(jj,q, So) with a diagonal covariance matrix So, the reward function is given by fg(a) — 
9 a , and the reward noise R t — fg{At) is normally distributed and independent from (H tl A t , 9). Then, it is easy 
to show that, conditioned on the history H t , 9 remains normally distributed with independent components. 
Algorithm [3] presents an implementation of posterior sampling for this problem. The expectations are easy 
to compute and can also be computed recursively. 

Our second example treats a linear bandit problem. Here we assume 9 is drawn from a normal distribution 
N(p.Q, So) but without assuming that the covariance matrix is diagonal. We consider a linear reward function 
fe(a) = (cf)(a),9) and assume the reward noise R t — f§(A t ) is normally distributed and independent from 
(H t , A t , 9). As before, it is easy to show that, conditioned on the history H t , 9 remains normally distributed. 
Algorithm [4] presents an implementation of posterior sampling for this problem. The expectations can be 
computed efficiently via Kalman filtering. 
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Algorithm 3 Algorithm 4 

Independent Posterior Sampling Linear Posterior Sampling 



1: 


Sample Model: 


1: 


Sample Model: 








§t ~ N(jH-i,V t -i) 


2: 


Select Action: 


2: 


Select Action: 




A t G arg max„ £j 4 Q t (a) 




A t e arg max aS ^ (0(a), 6 t ) 


3: 


Update Statistics: For each a, 


3: 


Update Statistics: 




Mta «- nO a \H t ] 




fit 4- E[0\H t ] 




Z taa ^E[6 a 6 a \Ht\ 




Zt^E{06 T \H t ] 


4: 


Increment t and Goto Step 1 


4: 


Increment t and Goto Step 1 



For many problems, posterior sampling offers critical computational advantages over UCB algorithms. 
For example, if the action set A is a polytope encoded in terms of linear inequalities, Algorithm [2] becomes 
impractical because, as observed by |12| , the action selection step entails solving a problem equivalent to 
linearly constrained negative definite quadratic optimization, which is NP hard [23) . On the other hand, 
Algorithm [4] which is a posterior sampling counterpart to Algorithm [2j admits efficient implementation. 
In particular, each action can be selected by solving a linear program. More broadly, for many problems, 
sampling a reward function and then maximizing over actions is easier than simultaneously maximizing over 
reward functions and actions. It is also worth mentioning that, since Markov chain Monte Carlo methods 
can often be used to generate samples from posterior distributions, posterior sampling is a natural choice 
for problems in which the distribution over reward functions is complex. Designing efficient UCB algorithms 
that perform well with such problems is likely to be a far greater challenge. 

5 Confidence Bounds and Risk Decompositions 

Unlike UCB algorithms, posterior sampling does not make use of upper confidence bounds to encourage 
exploration and instead relies on randomization. As such, the two classes of algorithm seem very different. 
However, we will establish in this section a connection that will enable us in Section [6]to derive performance 
bounds for posterior sampling from those that apply to UCB algorithms. Since UCB algorithms have 
received much more attention, this leads to a number of new results about posterior sampling. Further, the 
relationship yields insight into the performance advantages of posterior sampling. 

5.1 UCB Risk Decomposition 

Consider a UCB algorithm with an upper confidence bound sequence U = {Ut\t £ N}. Recall that A t £ 
argmax a6 _4 t Ut{a) and A* t £ arg max agy !i t fg{a). We have following simple regret decomposition: 

fe{A*)-f e {At) = fe (A*) — U t (A t ) + U t (A t ) — fg (A t ) 

< [fe(A;)-Ut(A*t)]+[Ut(At)~fe{A t )} (1) 

The inequality follows follows from the fact that A t is chosen to maximize lit- If the upper confidence 
bound is an upper bound with high probability, as one would expect from a UCB algorithm, then the first 
term is negative with high probability. The second term, U t (A t ) — fg (A~t), penalizes for the width of the 
confidence interval. As actions are sampled Ut should diminish and converge on fg. As such, both terms of 
the decomposition should eventually vanish. An important feature of this decomposition is that, so long as 
the first term is negative, it bounds regret in terms of uncertainty about the current action A t . 

Taking the expectation of Q establishes that the T-period Bayes risk of a UCB algorithm satisfies 

T T 

BayesRisk (T, ir u ) < [U t (A t ) - fg(A t )] [f 9 (A* t ) - U t (A* t )} , (2) 

t=i t=i 

where 7T U is the policy derived from U. 
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5.2 Posterior Sampling Risk Decomposition 



As established by the following proposition, the Bayes risk of posterior sampling decomposes in a way 
analogous to what we have shown for UCB algorithms. Recall that, with some abuse of notation, for an 
upper confidence bound sequence {Ut\t £ N} we denote by Ut(a) the random variable Ut(H t )(a). Let ir FS 
denote the policy followed by posterior sampling. 

Proposition 1. For any upper confidence bound sequence {Ut\t £ N}, 

T T 

BayesRisk(T, ^ PS ) = E^ [U t (A t ) - f {A t )\ [f e (A*) - U t (A*)\ , (3) 

t=i t=i 

for all Tel 

Proof. Note that, conditioned on H t , the optimal action A* t and the action A t selected by posterior sampling 
are identically distributed, and Ut is deterministic. Hence, E \U t (A1) | H t ] = E \U t (A t ) \ H t ]. Therefore 

E [f e (A*) - f e (A t )\ = E [E [f e (A* t ) - f g (A t ) \ H t \] 

= E [E [U t (A t ) - U t (A*) + f e (A* t ) - f e (A t ) \ H t \] 

= E [E \U t {A t ) - fg(A t ) | H t ] +E [f e (A* t ) - U t (A* t ) \ H t }} 

= E [U t (A t ) - fe(A t )} + E [f e (A*) - U t (A*)] 

Summing over t gives the result. □ 

To compare Q and ^ consider the case where fe takes values in [0, C]. Then, 

T T 

BayesRisk(T, n u ) <E^ [U t (A t ) - fg(A t )] +C^P(/ e (A t *) > U t (A* t )) 
t=i t=i 

and 

T T 

BayesRisk (T, vr PS ) < E^PMt) - fe(A t )} + C^P(/ e (A* t ) > U t (A* t )) . 
t=i t=i 
An important difference to take note of is that the Bayes risk bound of ir u depends on the specific upper 
confidence bound sequence U used by the UCB algorithm in question whereas the bound of 7r applies 
simultaneously for all upper confidence bound sequences. This suggests that, while the Bayes risk of a UCB 
algorithm depends critically on the specific choice confidence sets, posterior sampling depends on the best 
possible choice of confidence sets. This is a crucial advantage when there are complicated dependencies among 
actions, as designing and computing with appropriate confidence sets presents significant challenges. In fact, 
even in the case of a linear model, the design of confidence sets has required sophisticated tools from the 
study of multivariate self-normalized martingale processes (21. This difficulty is likely the main reason that 
posterior sampling significantly outperforms recently proposed UCB algorithms in the simulations presented 
in Section 13 

We have shown how upper confidence bounds characterize Bayes risk bounds for posterior sampling. We 
will leverage this concept in the next two sections. Let us emphasize, though, that while our analysis of 
posterior sampling will make use of upper confidence bounds, the actual performance of posterior sampling 
does not depend on upper confidence bounds used in the analysis. 



6 From UCB to Posterior Sampling Risk Bounds 

In this section we present Bayes risk bounds for posterior sampling that can be derived by combining our 
risk decomposition (|3| with results from prior work on UCB regret bounds. Each UCB regret bound was 
established through a common procedure, which entailed specifying lower and upper confidence bounds 
L t : A H> E and Ut : A H> K so that L t (a) < fe(a) < Ut(a) with high probability for each t and a, and then 
providing an expression that domates the sum (Ut — L t )(at) for all sequences of actions a\, ..,ot- As 
we will show, each such analysis together with our risk decomposition ([3| leads to a Bayes risk bound for 
posterior sampling. 
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6.1 Finitely Many Actions 

We consider in this section a problem with \A\ < oo actions and rewards satisfying R t £ [0, 1] for all t almost 
surely. We note, however, that the results we discuss can be extended to cases where R t is not bounded but 
where instead its distribution is "light-tailed." It is also worth noting that we make no further assumptions 
on the class of reward functions T or on the prior distribution over 9. 

In this setting, Algorithm [l] which was proposed by (8j, is known to satisfy a problem-independent regret 
bound of order y' r \A\T\ogT. Under an additional assumption that action rewards are independent and take 
values in {0, 1}, an order y/\A\TlogT regret bound for posterior sampling is also available [Hj. 

Here we provide a Bayes risk bound that is also of order y/\A\TlogT but does not require that action 
rewards are independent or binary. Our analysis, like that of |8j, makes use of confidence sets that are 
Cartesian products of action-specific confidence intervals. The risk decomposition ^ lets us use such confi- 
dence sets to produce bounds for posterior sampling even when the algorithm itself may exploit dependencies 
among actions. 

Proposition 2. // \A\ < oo and R t £ [0, 1] for all t almost surely then 



BayesRisk (T, tt ps ) < 4 + min { , T} + 4^\A\T\ogT 

for all T £ N. 

Proof. Let N t (a) be the number of times an action a has been sampled prior to time t, and let /tt(a) denote 
the empirical mean of rewards from playing action a. We consider upper and lower confidence bounds 



[ 1 otherwise [ otherwise 

as in (|. Using (|J, and noting that U t (A t ) - L t {A t ) > U t (A t ) - f e {A t ) f e (A t ) > L t (A t ). gives 

T T 

BayesRisk (T, tt ps ) < [U t (A t ) - L t (A t )} + £ [V (fe(A* t ) > U t {A*)) + P (f e (A t ) < L t (A t ))} 
t=i t—i 

By the Azuma-Hoeffding inequality, V (f e (A* t ) > U t (A* t )) < jj and V (f e (A t ) < L t (A t )) < i so that the 
second sum is less than 4. Let A T — {a £ A : Nt+i(o) > 0} and note that \A T \ < min{|.4|, T}. Then, 

T / N T+1 (a) r — \ 

J2Ut(A t )-L t (A t )< £ 1+ E 2 V^f- <l^ T |+4yiogT£y]Wa) 

t=l a£A T \ i=l % I aeA 



By Jensen's inequality, J2 aeA ^/N T+1 {a) = \A\ E„ ^ N T+1 {a) < \A\J^ a ^^T+i(a) - v^T. □ 



6.2 Linear and Generalized Linear Models 

We now consider function classes that represent linear and generalized linear models. The bound of Proposi- 
tion [2] applies so long as the number of actions is finite, but we will establish alternative bounds that depend 
on the dimension of the function class rather than the number of actions. Such bounds accommodate prob- 
lems with infinite action sets and can be much stronger than the bound of Proposition [2] when there are 
many actions. 

The Bayes risk bounds we provide in this section derive from regret bounds of the UCB literature. In 
Section [7J we will establish a Bayes risk bound that is as strong for the case of linear models and stronger 
for the case of generalized linear models. Since the results of Section [7] to a large extent supersede those we 
present here, we aim to be brief and avoid formal proofs in this section's discussion of the bounds and how 
they follow from results in the literature. 
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6.2.1 Linear Models 

In the "linear bandit" problem studied by [2|[3| |12|[2l] ) reward functions are parameterized by a vector 
9 6 C M. d , and there is a known feature mapping <j> : A H> R d such that f e (a) = (4>(a), 9). The following 
proposition establishes Bayes risk bounds for such problems. The proposition uses the term cr-sub-Gaussian 
to describe any random variable X that satisfies Eexp(AX) < exp(A 2 <r 2 /2) for all A £ K. 

Proposition 3. Fix positive constants a, c\, and C2- If O C M. d , fe(a) = (4>(a),8) for some <f> : A i-» M, 
sup pe Q \\pW2 < C\, and sup o6 _ 4 ||^( a )||2 < c 2, and for each t, R t — fg(A t ) conditioned on (H t ,A t ,9) is 
a -sub-Gaussian, then _ 

BayesRisk (T, tt ps ) = O(dloglVT), 

and 

PS 



BayesRisk (T, tt ps ) = O \E^\\9\\ dT 

The second bound essentially replaces the dependence on the dimension d with one on E-\/||6>|| d. The 
"zero-norm" \\0\\ Q is the number of nonzero components, which can be much smaller than d when the reward 
function admits a sparse representation. Note that O ignores logarithmic factors. Both bounds follow from 
our risk decomposition ([3| together with the analysis of J2j, in the case of the first bound, and the analysis 
of [3], in the case of the second bound. We now provide a brief sketch of how these bounds can be derived. 

If fg takes values in [— C, C] then ^ implies 

T T 

BayesRisk(T, 7r PS ) < E [U t (A t ) — L t (A t )] + 2C ^ [P (fg(A* t ) > U t (A* t )) + P (f e (A t ) < L t (A t ))} . (4) 
t=i t=i 

The analyses of (2] and [3] each follows two steps that can be used to bound the right hand side of this equa- 
tion. In the first step, an ellipsoidal confidence set 0t := {p £ M. d : \\p— 9t\\v t < \/~Pt\ is constructed, where for 
some A € K, Vt :— Y^k=i 4>{At)4>{A t ) T + XI captures the amount of exploration carried out in each direction 
up to time t. The upper and lower bounds induced by the ellipsoid are Ut(a) := max {C, max p6 e t (p T< / ) ( a )) } 
and L t {a) := min {— C, min pg e t (p T 0( a )) }■ If t nc sequence of confidence parameters /3i , . . . , 0x is selected 
so that P(6> ^ Q t \Ht) < 1/T then the second term of the risk decomposition is less than AC. For these 
confidence sets, the second step establishes a bound on Y^ii^t — Lt)(a>t) that holds for any sequence of 
actions. The analyses presented on pages 7-8 of [12] and pages 14-15 of (2| each implies such a bound of 
order J d max t <T /3*T log(T/ A) . Plugging in closed form expressions for /3 t provided in these papers leads to 
the bounds of Proposition [3j 

6.2.2 Generalized Linear Models 

In a generalized linear model, the reward function takes the form fg(a) := g((<fi(a),9 )) where the inverse 
link function g is strictly increasing and continuously differentiable. The analysis of 113] can be applied to 



establish a Bayes risk bound for posterior sampling, but with one caveat. The algorithm considered in 13 
begins by selecting a sequence of actions 01, .., with linearly independent feature vectors 0(ai), . . . , (j)(ad)- 
Until now, we haven't even assumed such actions exist or that they are guaranteed to be feasible over the 
first d time periods. After this period of designed exploration, the algorithm selects at each time an action 
that maximizes an upper confidence bound. What we will establish using the results from [13] is a bound 
on a similarly modified version of posterior sampling, in which the first d actions taken are ai, . . . , a ( i, while 
subsequent actions are selected by posterior sampling. Note that the posterior distribution employed at time 
d + 1 is conditioned on observations made over the first d time periods. We denote this modified posterior 
sampling algorithm by 7r^ PS ad . It is worth mentioning here that in Section 7 we present a result with a 
stronger bound that applies to the standard version of posterior sampling, whicn does not include a designed 
exploration period. 

Proposition 4. Fix positive constants C\, C2, C, and A. If Q C R d , fg(a) = g(((f>(a),9)) for some strictly 
increasing continuously differentiable function j : 1 h> R, sup pg g ||p||2 < C\, sup ag _4 ||0(a)||2 < c-i, At = A 
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for all t, 4'( a i) ( t > ( a i) T t ^ f or some a\, . . . , ad £ A, and R t £ [0, C] for all t, then 

BayesRisk(T, <^.., a J = 0{rd log 3 / 2 TVf), 
where r = sup pa g'((<j)(a),p))/mi p ^ a g'(((j)(a),p)). 

Like the analyses of (2j and |3] which apply to linear models, the analysis of [13] follows two steps 
that together bound both terms of our risk decomposition ([4}. First, an ellipsoidal confidence set 0t 
is constructed, centered around a quasi-maximum likelihood estimator. This confidence set is designed 
to contain 9 with high probability. Given confidence bounds U t (a) := max{C, max pe e t <?((<M a )i P))} & n d 
L t (a) := min{0, min pe e t <?((<K a )j a worst case bound on Y^(Ut — L t )(at) is established. The bound is 
similar to those established for the linear case, but there is an added dependence on the the slope of g. 



6.3 Gaussian Processes 

In this section we consider the case where the reward function fg is sampled from a Gaussian process. That is, 
the stochastic process (fg(a) : a £ A) is such that for any a\,..,ak £ A the collection fe(ai), .., fe(ak) follows 



a multivariate Gaussain distribution. The paper 25 studies a UCB algorithm designed for such problems 
and provides general regret bounds. Again, through the risk decomposition |3]) their analysis provides a 
Bayes risk bound for posterior sampling. 

For simplicity, we focus our discussion on the case where A is finite, so that (fg(a) : a £ A) follows a 



multivariate Gaussian distribution. As shown in 25 , the results extend to infinite action sets through a 



discretization argument as long as certain smoothness conditions are satisfied. 

When confidence bounds hold, a UCB algorithm incurs high regret from sampling an action only when 
the confidence bound at that action is loose. In that case, one would expect the algorithm to learn a lot 
about fg based on the observed reward. This suggests the algorithm's cumulative regret may be bounded in 
an appropriate sense by the total amount it is expected to learn. Leveraging the structure of the Gaussian 
distribution, the analysis of (25] formalizes this idea. They bound the regret of their UCB algorithm in 
terms of the maximum amount that any algorithm could learn about fg . They use an information theoretic 
measure of learning: the information gain. This is defined to be the difference between the entropy of the 
prior distribution of (fg{a) : a £ A) and the entropy of the posterior. The maximum possible information 
gain is denoted jt, where the maximum is taken over all sequences ax, .., arQ Their analysis also supports 
the following result on posterior sampling. 

Proposition 5. If A is finite, (fg(a) : a £ A) follows a multivariate Gaussian distribution with marginal 
variances bounded by 1, R t — fg(A t ) is independent of (H t , 6, A t ), and {R t — fg(A t )\t £ N} is an iid sequence 
of zero-mean Gaussian random variables with variance a 2 , then 



BayesRisk (T, tt ps ) < 1 + 2 Jr 7T ln (1 + a- 2 )" 1 In ( ^ + ! ' ^ 



2tt 



for all T £ N. 



The paper [25] also provides bounds on 77- for kernels commonly used in Gaussian process regression, 
including the linear kernel, radial basis kernel, and Matern kernel. Combined with the above proposition, 
this yields explicit Bayes risk bounds in these cases. 

We will briefly comment on their analysis and how it provides a bound for posterior sampling. First, 
note that the posterior distribution is Gaussian, which suggests an upper confidence bound of the form 
Ut{a) '■= pt-i{o) + y/Pt&t-iia), where pt-i{o) is the posterior mean, er t _i(a) is the posterior standard 
deviation of fg[a), and f3 t is a confidence parameter. We can provide a Bayes risk bound by bounding both 
terms of d3l. The next lemma bounds the second term. Since the Gaussian distribution is unbounded and 



we study Bayes risk instead of regret, the proof is slightly different from the analysis of 25 



1 An important property of the Gaussian distribution is that the information gain does not depend on the observed rewards. 
This is because the posterior covariance of a multivariate Gaussian is a deterministic function of the points that were sampled. 
For this reason, this maximum is well defined. 
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Lemma 1. IfU t (a) ;= /i t _i(a) + -\//3t0t_i(a) and (3 t 
for all TeN. 



2 In 



(* 2 + l)l-A| 



/2tt 



then^Ui[h{A* t )-U t {A*)]<l 



exp 



Proof. First, if X ~ N(jj,, a 2 ) then if /u < 0, E [XI {X > 0}] = /" 
Then since the posterior distribution of fe(a) — Ut{a) is normal with mean 



{^} 



dx 



/2tt 



exp 



t<Jt-i{a) and variance (Tt_ 1 (a) 



E [1 {f e (a) - U t (a) > 0} [f e (a) - U t (a)] \H t _ 



ft-i(a) 



exp 



-Jt 
2 



ct-i(a) 

(t 2 + l)L4| - ( t 2 + 1)|^| ■ 



< 



(5) 



The final inequality above follows from the assumption that ao(a) < 1. The claim follows from ([5| since 

T oo oo 



E £ lf e (A* t ) - U t (A*t)} < E E E t 1 - W ^ °i ~ ^ E 



1 



t=i 



t=l aG.4 



1 



< 1. 



□ 



Now, consider the first term of pj, which is: 

T T T 

E^((/ t -fe){A t ) =Ej2(Ut~ Ht-i){A t ) =E^^a 1 _ 1 (i ( ) <E /T^ft 



. E^-i(^) 
\ *=i 



Here the second equality follows by the tower property of conditional expectation, and the final step follows 
from the Cauchy-Schwartz inequality. Therefore, to establish a Bayes risk bound it is sufficient to provide a 
bound on the sum of posterior variances 53t=j °f-i( a t) that holds for any oi, ..,aT- Under the assumption 



that oo(a) < 1, the proof of Lemma 5.4 of 25 shows that a 2 _ 1 (at) < a 1 log (l + a 2 a 2 _ 1 (at)) 1 where 



a = (l + c 2 )- At the same time, Lemma 5.3 of |25| shows the information gain from selecting oi, ...ot is 
equal to I + 17 2fT ?-i( a t))- This shows that for any actions oi, .., ax the the sum of posterior 

variances 5^t=i a t-i( a t) c ' An he bounded in terms of the information gain from selecting a\, .., a^. Therefore 
^^j^ o- 2 _ 1 (A t ) can be bounded in terms of the largest possible information gain 7^. 



7 Bounds for General Function Classes 

The previous section treated models in which the relationship among action rewards takes a simple and 
tractable form. Indeed, nearly all of the multi-armed bandit literature focuses on such problems. Posterior 
sampling can be applied to a much broader class of models. As such, more general results that hold beyond 
restrictive cases are of particular interest. In this section, we provide a Bayes risk bound that applies 
when the reward function lies in a known, but otherwise arbitrary class of uniformly bounded real- valued 
functions J- . Our analysis of this abstract framework yields a more general result that applies beyond the 
scope of specific problems that have been studied in the literature, and also identifies factors that unify more 
specialized prior results. Further, our more general result when specialized to linear models recovers the 
strongest known Bayes risk bound and in the case of generalized linear models yields a bound stronger than 
that established in prior literature. 

If T is not appropriately restricted, it is impossible to guarantee any reasonably attractive level of Bayes 
risk. For example, in a case where A = [0,1], fg(a) = 1(8 = a), J- — {fe\& £ [0,1]}, and 8 is uniformly 
distributed over [0, 1], it is easy to see that the Bayes risk of any algorithm over T periods is T, which is no 
different from the worst level of performance an agent can experience. 

This example highlights the fact that Bayes risk bounds must depend on the function class J- . The bound 
we develop in this section depends on T through two measures of complexity. The first is the Kolmogorov 
dimension, which measures the growth rate of the covering numbers of J- and is closely related to measures 
of complexity that are common in the supervised learning literature. It roughly captures the sensitivity of T 
to statistical over-fitting. The second measure is a new notion we introduce, which we refer to as the margin 
dimension. This captures how effectively the value of unobserved actions can be inferred from observed 
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samples. We highlight in Section [73] why notions of dimension common to the supervised learning literature 
are insufficient for our purposes. 

Though the results of this section are very general, they do not apply to the entire range of problems 
represented by the formulation we introduced in Section[3] In particular, throughout the scope of this section, 
we fix constants C > and a > and impose two simplifying assumptions. The first concerns boundedness 
of reward functions. 

Assumption 1. For all f 6 F and a € A, f(a) <G [0, C}. 

Our second assumption ensures that observation noise is light-tailed. Recall that we say a random variable 
x is er-sub-Gaussian if E[exp(Ax)] < exp(A 2 cr 2 /2) almost surely for all A. 

Assumption 2. For all £ € N, Rt — fe{A t ) conditioned on (H t ,9,A t ) is a -sub -Gaussian. 

It is worth noting that the Bayes risk bounds we provide are distribution independent, in the sense that we 
show BayesRisk(T, 7r PS ) is bounded by an expression that does not depend on P(9 G •). 

Our analysis in some ways parallels those found in the literature on UCB algorithms. In the next section 
we provide a method for constructing a set J; C J of functions that are statistically plausible at time £. Let 
wjr(a) :— supj e _ F /(a) — mif^jr f{a) denote the width of F at a. Based on these confidence sets, and using 

the risk decomposition (Jij), one can bound Bayes risk in terms of Y^i w ^t(At)- In Section 7.2 we establish 
a bound on this sum in terms of the Kolmogorov and margin dimensions of F. 



7.1 Confidence Bounds 

The construction of tight confidence sets for specific classes of functions presents technical challenges. Even 
for the relatively simple case of linear bandit problems, significant analysis is required. It is therefore perhaps 
surprising that, as we show in this section, one can construct strong confidence sets for an arbitrary class of 
functions without much additional sophistication. While the focus of our work is on providing a Bayes risk 
bound for posterior sampling, the techniques we introduce for constructing confidence sets may find broader 
use. 

The confidence sets constructed here are centered around least squares estimates / t is € argmin/ e jr £2,4 (/) 
where L2,t(/) = 1 (f(At) — Rt) 2 is the cumulative squared prediction errorj^] The sets take the form 
Ft := {/ € 7 : ||/ — ft S \\2,E t < \fWt] where /3 t is an appropriately chosen confidence parameter, and the 
empirical 2-norm |j-|| 2 Et is defined by 1 1 <7 1 1 2 e ± — 1 9 2 {Ak)- Hence ||/ — fg\\\ g t measures the cumulative 
discrepancy between the previous predictions of / and fg. 

The following lemma is the key to constructing strong confidence sets {Ft : £ G N). For an arbitrary 
function /, it bounds the squared error of / from below in terms of the empirical loss of the true function fg 
and the aggregate empirical discrepancy ||/ — fe\\ 2 Et between / and fg. It establishes that for any function 
/, with high probability, the random process (£2,t(/) : £ € N) never falls below the process (L2.t(/e) + III/ - 
/0II2 E t : £ G 1^) by more than a fixed constant. A proof of the lemma is provided in the appendix. 

Lemma 2. For any 5 > and f : A H- K, with probability at least 1 — 8, 

L 2 .t{f) > L 2it {f e ) + \ ||/- M\t Et ~ 4a 2 log(l/<5) 

simultaneously for all t £ N. 

By Lcmma|2] with high probability, / can enjoy lower squared error than fg only if its empirical deviation 
II/— /e|| 2 e f rom fe is less than 8cr 2 log(l/<5). Through a union bound, this property holds uniformly for 
all functions in a finite subset of F. Using this fact and a discretization argument, together with the 
observation that L2 i t(//" S ) < L2,t(/e), we can establish the following result, which is proved in the appendix. 
Let N{F, a, IHI^) denote the a-covering number of F in the sup-norm || • Hoc, and let 

PHF,8,a) ■= 8^ 2 log {N{F, a, H^/S) + 2at (W + ^8a 2 ln(4£ 2 /<J)) . (6) 

2 The results can be extended to the case where the infimum of -L2,t(/) is unattainable by selecting a function with squared 
prediction error sufficiently close to the infimum. 
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Proposition 6. For all 5 > and a > 0, if 



F 



f£F 



/ - f t LS 



2,E t 



for all t G N, then 



fe 6 f) F t > 1 



25. 



While the expression ([6]) defining the confidence parameter is complicated, it can be bounded by simple 
expressions in important cases. We provide three examples. 

Example 4. Finite function classes: When F is finite, Pl(F, 8, 0) = 8a 2 log(|.F| /5). 

Example 5. Linear Models: Consider the case of a d- dimensional linear model f p (a) :— (4>(a), p) . Fix 
7 = sup aG-A ||0(a)|| and s = sup pee ||p||. Hence, for all pi, p 2 € F, we have ||/ pi - / P3 ||oo < tIIpi - P2W An 
a-covering of F can therefore be attained through an (a/ 7)- covering of O C K d . Such a covering requires 
0((l/a) d ) elements, and it follows that, \ogN(F, a, IHI^) = 0(d log(l/a)). If a is chosen to be 1/t 2 , the 
second term in tends to zero, and therefore, Pt(F,5, 1/t 2 ) — 0(dlog(t/5)). 

Example 6. Generalized Linear Models: Consider the case of a d-dimensional generalized linear model 
fe{a) :— g(((f>(a), 9}) where g is an increasing Lipschitz continuous function. Fix g, 7 ~ sup Qg _^ ||0(a)||, and 
s = sup 6 q ||p||. Then, the previous argument shows \ogN(F, a, IHI^) = 0(dlog(l/a)). Again, choosing 
a = 1/t yields a confidence parameter f3'^(F,S, 1/t 2 ) — 0(d\og(t/S)). 

The confidence parameter fil(F, 1/t 2 , 1/t 2 ) is closely related to the following concept. 

Definition 1. The Kolmogorov dimension of a function class F is given by 

A - y logger, a, HU 

dim K (F) = hmsup — — — - — 

c40 log(l/a) 

In particular, we have the following result. 
Proposition 7. For any fixed class of functions F, 

PI (F, 1/t 2 , 1/t 2 ) = 16(1 + o(l) + dim K (F)) logt. 

Proof. By definition 



ft (F, 1/t 2 , 1/t 2 ) = 8a 2 



log(N(F 1/t 2 , ||-U) 
log(i 2 ) 



The result follows from the fact that limsuplog (N (J 7 , 1/t 2 , 



log (t 2 ) +2-[8C+ ^8a 2 ln(At 2 d) 
■\U)/log{t 2 ) =dim K (F). 



□ 



7.2 Bayes Risk Bounds 

In this section we introduce a new notion of complexity - the margin dimension — and then use it to develop 
a Bayes risk bound. First, we note that using the risk decomposition (|3| and the confidence sets (Ft '■ t G N) 
constructed in the previous section, we can bound the Bayes risk of posterior sampling in terms confidence 
interval widths wjr(a) :— supj e jr /(a) — infygjr /(a). In particular, the following lemma follows from our 
risk decomposition ([3]). 

Lemma 3. For all T G N, ifmi pe jr T f p (a) < fe(a) < sup p(E jr T / p (a) for all t £ N and a 6 A with probability 
at least 1 — 1/T then 

T 

BayesRisk(T, tt ps ) < C + E ^ w Tt (At). 

t=i 
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We can use the confidence sets constructed in the previous section to guarantee that the conditions of this 
lemma hold. In particular, choosing S = 1/2T in (JsJ guarantees that fg G Ht^i -Ft with probability at least 
1 - 1/T. 

Our remaining task is to provide a worst case bound on the sum Y^x w Ft O^t)- First consider the case of 
a linearly parameterized model where f p (a) ■= (<fi(a), p) for each p G C M. d . Then, it can be shown that 
our confidence set takes the form Tt := {f p : p G Qt} where Q t C M d is an ellipsoid. When an action A t is 
sampled, the ellipsoid shrinks in the direction 4>(A t ). Here the explicit geometric structure of the confidence 
set implies that the width wjr t shrinks not only at A t but also at any other action whose feature vector is not 
orthogonal to <p(A t ). Some linear algebra leads to a worst case bound on Y^x Wj? t O^t)- For a general class of 
functions, the situation is much subtler, and we need to measure the way in which the width at each action 
can be reduced by sampling other actions. To do this, we introduce the following notion of dependence. 

Definition 2. An action a G A is e- dependent on actions {ax, a n } C A with respect to J- if any pair 

of functions /. / £ J satisfying \^Y17=i(f( a i) ~ f( a i)) 2 — e a l so satisfies f(a) — /(a) < e. Further, a is 
e-independent of {ai, .., a„} with respect to J 7 if a is not e-dependent on {ai, .., a n }. 

Intuitively, an action a is independent of {ax, a n } if two functions that make similar predictions at 
{ax, ...,a n } can nevertheless differ significantly in their predictions at a. The above definition measures 
the "similarity" of predictions at e-scale, and measures whether two functions make similar predictions 

at {ai,...,a n } based on the cumulative discrepancy \J ^7=1^ ( a i) — /(a.;)) 2 . This measure of dependence 
suggests using the following notion of dimension. 

Definition 3. The e-margin dimension &voxm(T , e) is the length d of the longest sequence of elements in A 
such that, for some e' > e, every element is e'-independent of its predecessors. 

Recall that a vector space has dimension d if and only if d is the length of the longest sequence of elements 
such that each element is linearly independent or equivalently, O-independent of its predecessors. Definition 
[3] replaces the requirement of linear independence with e-independence. This extension is advantageous as it 
captures both nonlinear dependence and approximate dependence. The following result uses our new notion 
of dimension to bound the number of times the width of the confidence interval for a selected action A t can 
exceed a threshold. 

Proposition 8. If (/3 t > 0\t G N) is a nondecreasing sequence and Tt '■= {/ G T : ||/ — f^ s \\2,E t < ^fWt) 
then 

J2l(wrM > e) < (^f + l) dim M (T,e) 

for allT G N and e > 0. 

Proof. We begin by showing that if w t (A t ) > e then A t is e-dependent on fewer than 4/3^/e 2 disjoint 
subsequences of (Ax, .., At-i), for T > t. To see this, note that if wjr t (A t ) > e there are /, / G Tt such 
that f(A t ) — f(A t ) > e. By definition, since f(A t ) — f(A t ) > e, if A t is e-dependent on a subsequence 
(Ai^.., A lk ) of (Ax, .., A-i) then E^=i(7(^) - l(A 3 )) 2 > e 2 - It follows that, if A t is e-dependent on K 
disjoint subsequences of (Ax, ..,At-x) then ||/ — > i^e 2 . By the triangle inequality, we have 

11^ ~ /.IL.b, — 

and it follows that K < Aj3 T /e 2 . 

Next, we show that in any action sequence (ai, .., a T ), there is some element aj that is e-dependent on at 
least t jd — 1 disjoint subsequences of (ax, •■, Qj-i), where d := &\tiim(T, e). To show this, for an integer K 
satisfying Kd + 1 < r < Kd + d, we will construct K disjoint subsequences B\, . . . , Bk- First let Bi — (ai) 
for i = l, ..,K. If ax+x is e-dependent on each subsequence B±, ..,Bk, our claim is established. Otherwise, 
select a subsequence Bi such that a^+x is e-independent and append a^+i to Bi. Repeat this process for 
elements with indices j > K + 1 until aj is e-dependent on each subsequence or j = r. In the latter scenario 



7-f t LS 



1.K* 



f ^ 



2. E< 
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Y^, \Bi\ > Kd, and since each element of a subsequence Bi is e-independent of its predecessors, \Bi\ = d. In 
this case, a T must be e-dependent on each subsequence. 

Now consider taking (ai, .., a T ) to be the subsequence (A tl , . . . , A tx ) of (Ai, . . . , At) consisting of ele- 
ments A t for which uijr t (A t ) > e. As we have established, each A tj is e-dependent on fewer than 4/3r/e 2 
disjoint subsequences of (Ai, ..,A tj -i). It follows that each aj is e-dependent on fewer than 4/^/e 2 disjoint 
subsequences of (ai, ..,aj_i). Combining this with the fact we have established that there is some <Zj that is 
e-dependent on at least r/d — 1 disjoint subsequences of (a\, .., Oj-i), we have r/d — 1 < 4/3t/c 2 - It follows 
that r < (4/?tA 2 + l)d which is our desired result. □ 

Using Proposition [s[ one can bound the sum J2t=i w J r t( J ^t)i as established by the following lemma. 

Lemma 4. // (f3 t > 0\t £ N) is a nondecreasing sequence and J f := {/ £ J : j|/ — f^ s \\2,E t < \/Wt\ then 
T 

w ^ {At) < 1 + dim M (J", T- 1 ) C + 4Vdim M (-F,T-i)/3 T T 

t=i 

for all TeN. 

Proof. To reduce notation, write d — dim A / (J 7 , T -1 ) and w t = w t (A t ). Reorder the sequence wt) — > 

(v)i L , ...,Wi T ) where u>i 1 > Wi 2 > ... > w^ T . We have 

J2wr t (A t ) = J2w it =Y,w it l{w it KT-^+Y.Witlfa >T~ 1 } <1+J2w k l{w it >T- X }. 

t=i t=i (=i t=i t=i 

We know iu.j t < C. In addition, it>j t > e <^=> X)fc=i ^ i w ^k (Ak) > e) > t. By Proposition^ this can 
only occur if t < ( + 1 ) dimM^-? 7 , e)- For e > T -1 , dimM^, e) < dimAfC-T 7 , T _1 ) = d, since dimj\,/ (J 7 , e') 



is nonincreasing in e'. Therefore, when Wi t > e > T 1 , t < ^-^r 1 + lj which implies e < y 4 t f3r rf d ■ This 
shows that if Wi t > T _1 , then Wi t < min |c, ^/ ^t-d } • Therefore, 



T T / 

^]u; lt l{u; lt >T- 1 } < dC + \ ^ dC ' + V'Pr / \& = dC + ±y/ d^T 



□ 



Our next corollary, which follows from Lemma [3] Lemma [4j and Proposition [6] establishes a Bayes risk 
bound. 

Corollary 1. For all T £ N, a > and S < 1/2T, 

BayesRisk (T, tt ps ) < 1 + [dim M (J 7 , + l] C + 4^dim M (J 7 , T^ 1 ) /3 T (J 7 , a,8)T. 

Using bounds on /3 t * provided in the previous section together with Corollary [T] yields Bayes risk bounds 
that depend on J 7 only through the margin dimension and either the cardinality or Kolmogorov dimension. 
The following proposition provides such bounds. 

Proposition 9. For any fixed class of functions J- , 



BayesRisk (T, tt ps ) < 1 + [dim M (T,^ 1 ) + l] C + 16oV dim iu (J 7 , T _1 ) (1 + o(l) + dim K (J 7 )) log(T)T 
/or T G N. Further, if J- is finite then 

BayesRisk (T, tt ps ) < 1 + [dim A f (J 7 , T" 1 ) + l] C + 8o- v / 2dim J1/ (J 7 , T^ 1 ) log (2 T) T, 
for all TeN. 
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The next two examples show how the first Bayes risk bound of Proposition [9] specializes to d-dimensional 
linear and generalized linear models. For each of these examples, a bound on dimM (J 7 , e) is provided in the 
appendix. 

Example 7. Linear Models: Consider the case of a d- dimensional linear model f p (a) := (</>(a), p). Fix^ = 
sup ag _4 ||0(a)|| and s = sup^g, ||p||. Then, diniM^, e) = 0((ilog(l/e)) and dim^J 7 ) = 0{d). Proposition 
[$| therefore yields an 0(dVTlog(T)) Bayes risk bound. This is tight to within a factor of log T [211, and 
matches the best available bound for a linear UCB algorithm, 

Example 8. Generalized Linear Models: Consider the case of a d-dimensional generalized linear model 
fe(a) := g (((f>(a), 9)) where g is an increasing Lipschitz continuous function. Fix 7 = sup ag _4 ||0(a)|| and s = 
su P P ee IIHI- Then, dimj^J 7 ) = 0(d) and dim M (J r , e) = 0(r 2 d\og(re)), where r = supj jO p'((0(a), 6))/ infg a g'({<j)(a), 9)) 
bounds the ratio between the maximal and minimal slope of g. Proposition^ yields an 0(rd\/T\og(rT)) Bayes 
risk bound. We know of no other guarantee for posterior sampling when applied to generalized linear models. 
In fact, to our knowledge, this bound is a slight improv eme nt over the strongest Bayes risk bound available 



for any algorithm in this setting. The regret bound of 13 ] translates to an 0(rd\/T\og 3 ^ 2 (T)) Bayes risk 
bound. 

One advantage of studying posterior sampling in a general framework is that it allows bounds to be 
obtained for specific classes of models by specializing more general results. This advantage is highlighted by 
the ease of developing a performance guarantee for generalized linear models. The problem is reduced to one 
of bounding the margin dimension, and such a bound follows almost immediately from the analysis of linear 
models. In prior literature, extending results from linear to generalized linear models required significant 
technical developments, as presented in (l3| . 



7.3 Relation to the Vapnik-Chervonenkis Dimension 

To close our section on general bounds, we discuss important differences between our new notion of margin 
dimension and complexity measures used in the analysis of supervised learning problems. We begin with an 
example that illustrates how a class of functions that is learnable in constant time in a supervised learning 
context may require an arbitrarily long duration when learning to optimize. 

Example 9. Consider a finite class of binary-valued functions J- = {/„ : A <— > {0, 1} | p € {1, . . . , n}} over 
a finite action set A = {1, . . . , n}. Let f p (a) = l(p = a), so that each function is an indicator for an action. 
To keep things simple, assume that R t — fo(A t ), so that there is no noise. If 9 is uniformly distributed over 
{1, . . . , n}, it is easy to see that the Bayes risk of posterior sampling grows linearly with n. For large n, until 
9 is discovered, each sampled action is unlikely to reveal much about 9 and learning therefore takes very long. 

Consider the closely related supervised learning problem in which at each time an action A t is sampled 
uniformly from A and the mean-reward value fg(A t ) is observed. For large n, the time it takes to effectively 
learn to predict fg(A t ) given A t does not depend on t. In particular, prediction error converges to 1/n in 
constant time. Note that predicting at every time already achieves this low level of error. 

In the preceding example, the e-margin dimension is n for e € (0, 1). On the other hand, the Vapnik- 
Chervonenkis (VC) dimension, which characterizes the sample complexity of supervised learning, is 1. To 
highlight conceptual differences between the margin dimension and the VC dimension, we will now define VC 
dimension in a way analogous to how we defined margin dimension. We begin with a notion of independence. 

Definition 4. An action a is VC -independent of A C A if for any /, / G J there exists some f E J- which 
agrees with / on a and with / on A; that is, f(a) — f(a) and /(a) = /(a) for all a E A. Otherwise, a is 
VC-dependent on A. 

By this definition, an action a is said to be VC-dependent on A if knowing the values f £ J- takes on 
A could restrict the set of possible values at a. This notion independence is intimately related to the VC 
dimension of a class of functions. In fact, it can be used to define VC dimension. 

Definition 5. The VC dimension of a class of binary-valued functions with domain A is the largest cardi- 
nality of a set A C A such that every a £ A is VC-independent of „4\ {a}. 
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In the above example, any two actions are VC-dependent because knowing the label of one action could 
completely determine the value of the other action. However, this only happens if the sampled action 
has label 1. If it has label 0, one cannot infer anything about the value of the other action. Instead of 
capturing the fact that one could gain useful information about the reward function through exploration, we 
need a stronger requirement that guarantees one will gain useful information through exploration. Such a 
requirement is captured by the following concept. 

Definition 6. An action a is strongly- dependent on a set of actions A C A if any two functions /, / e J 
that agree on A agree on a; that is, the set {/(a) : /(a) = /(a) Va £ A} is a singleton. An action a is 
weakly independent of A if it is not strongly-dependent on A. 

According to this definition, a is strongly dependent on A if knowing the values of / on A completely 
determines the value of / on a. While the above definition is conceptually useful, for our purposes it is 
important to capture approximate dependence between actions. Our definition of margin dimension achieves 
this goal by focusing on the possible difference /(a) — f(a) between two functions that approximately agree 
on A. 



8 Simulation Results 

In this section, we compare the performance in simulation of posterior sampling to that of UCB algorithms 
that have been proposed in the recent literature. Our results demonstrate that posterior sampling signifi- 
cantly outperforms these algorithms. Moreover, we identify a clear cause for the large discrepancy: confidence 
sets proposed in the literature are too loose to attain good performance. 

We consider the linear model fg(a) = (4>(a), 9) where 9 € M. 10 follows a multivariate Gaussian distribution 
with mean vector fj, = and covariance matrix £ = 10/. The noise terms e t := R t — fe(A t ) follow a standard 
Gaussian distribution. There are 100 actions with feature vector components drawn uniformly at random 
from [-1/VT0, 1/VT0], and A t = A for each t. Figure [I] shows the portion {(j)(A*), 9) - (4>(A t ), 9) of regret 
attributable to each time period t in the first 1000 time periods. The results are averaged across 5000 trials. 
Several UCB algoriths are suitable for such problems, including those of (2 21 25 . While the confidence 



bound of 21 is stronger than that of 12 , it is still too loose and the resulting linear UCB algorithm 
hardly improves its performance over the 1000 period time horizon. We display the results only of the more 
competitive UCB algorithms. The line labeled "linear UCB" displays the results of the algorithm proposed 
in [2], which incurred average regret of 339.7. The algorithm of |25] is labeled "Guassian UCB," and incurred 
average regret 198.7. Posterior sampling, on the other hand, incurred average regret of only 97.5. 

Each of these UCB algorithms uses a confidence bound that was derived through stochastic analysis. 
The Gaussian linear model has a clear structure, however, which suggests upper confidence bounds should 
take the form U t {a) = /i t _i(a) + v / /?c r t _i(a) where /Lt t _i(a) and <r t _ 1 (a) are the posterior mean and standard 
deviation at a. The final algorithm we consider ignores theoretical considerations, and tunes the parameter 
j3 to minimize the average regret over the first 1000 periods. The average regret of the algorithm was only 
68.9, a dramatic improvement over [2], and |25| , and even outperforming posterior sampling. On the plot 
shown below, these results are labeled "Gaussian UCB - Tuned Heuristic." Note such tuning requires the 
time horizon to be fixed and known. 

In this setting, the problem of choosing upper-confidence bounds reduces to choosing a single confidence 
parameter j3. For more complicated problems, however, significant analysis may be required to choose a 
structural form for confidence sets. The results in this section suggest that it can be quite challenging to 
use such analysis to derive confidence bounds that lead to strong empirical performance. In particular, this 
is challenging even for linear models. For example, the paper [2] uses sophisticated tools from the study 



of multivariate self-normalized martingales to derive a confidence bound that is stronger than those of 12 
or (2TI , but their algorithm still incurs about three and a half times the regret of posterior sampling. This 
highlights a crucial advantage of posterior sampling that we have emphasized throughout this paper; it 
effectively separates confidence bound analysis from algorithm design. 
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Figure 1: Portion of regret attributable to each time period. 



Finally, it should be noted that the algorithms of [2,25] have free parameters that must be chosen by the 
user. We have attempted to set these values in a way that minimizes average regret over the 1000 period time 
horizon. Both algorithms construct confidence bounds that hold with a pre-specified probability 1 — 5 G [0,1]. 
Higher levels of 8 lead to lower upper-confidence bounds, which we find improves performance. We set 8 = 1 
to minimize the average regret of the algorithms. The algorithm of |2] requires two other choices. We used a 
line search to set the algorithm's rcgularization parameter to the level A = .025, which minimizes cumulative 
regret. The algorithm of |2 also requires a uniform upper bound on but the Gaussian distribution is 
unbounded. We avoid this issue by providing the actual realized value \\9\\ as an input to algorithm. 

9 Conclusion 

This paper has considered the use of a simple posterior sampling algorithm for learning to optimize actions 
when the decision maker is uncertain about how his actions influence performance. We believe that, par- 
ticularly for difficult problem instances, this algorithm offers significant potential advantages because of its 
design simplicity and computational tractability. Despite its great potential, not much is known about pos- 
terior sampling when there are dependencies between actions. Our work has taken a significant step toward 
remedying this gap. We showed that the Bayes risk of posterior sampling can be decomposed in terms of 
confidence sets, which allowed us to establish a number of new results on posterior sampling by leveraging 
prior work on UCB algorithms. We then used this risk decomposition to analyze posterior sampling in a 
very general framework, and developed Bayes risk bounds that depend on a new notion of dimension. 

In constructing these bounds, we have identified two factors that control the hardness of a particular 
multi-armed bandit problem. First, an agent's ability to quickly attain near-optimal performance depends 
on the extent to which the reward value at one action can be inferred by sampling other actions. However, 
in order to select an action the agent must make inferences about many possible actions, and an error in its 
evaluation of any one could result in large regret. Our second measure of complexity controls for the difficulty 
of maintaining appropriate confidence sets simultaneously at every action. While our bounds are nearly tight 
in some cases, further analysis is likely to yield stronger results in other cases. We hope, however, that our 
work provides a conceptual foundation for the study of such problems, and inspires further investigation. 
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A Proof of Confidence bound 



A.l Preliminaries: Martingale Exponential Inequalities 

Consider random variables (Z n \n £ N) adapted to the filtration (H n : n = 0,1, ...). Assume E [exp {AZj}] 
is finite for all A. Define the conditional mean /x.; = E [Zi \ Hi-i]. We define the conditional cumulant 
generating function of the centered random variable [Zi — fii] by fa (A) = logE [exp (A [Zj — /ij]) | Let 

M n (X) = exp X [ Z i - Mi] - i>i (A) | • 
Lemma 5. [M n (X)\n € N) is a Martinagale, and E-M„(A) = 1. 

Proof. By definition E[Mi (A) |-H ] = E[exp{A [Z x - Mi] - fa (A)} \H ] = E[exp{A[Zi - m]} [Ho]/ exp {fa (A)} 
1. Then, for any n > 2 



E[M n (A) \Hn-i] = E 



'n-l 



exp < ^2 X \ Z i ~ Mi] _ ^» (A) f exp {A [Z n - /i„] - ^ n (A)} | H n - X 



V i=l 
'n-l 



expj^A - ^] - ^ (A) | E [exp {A [Z„ - fi n ] - (A)} | W„-i] 
exp ( ^ A [Z 2 - m] - fa (A) } = M n _i(A) 



□ 



Lemma 6. For all x > and A > 0, P AZ, < a; + Ei [A^ + V'i (A)] Vn € N) > 1 - e - *. 



Proof. For any A, M„(A) is a martingale with EM n (A) = 1. Therefore, for any stopping time r, EAf rA „ (A) = 
1. For arbitrary x > 0, define t x = inf {n > | M n (A) > x} and note that t x is a stopping time corresponding 
to the first time M n crosses the boundary at x. Then, EAf TxA „(A) = 1 and by Markov's inequality: 

xF (M T ^ n (A) > x) < EM TiAn (A) = 1. 

We note that the event {M TxA „ (A) > x} = |Jfc=i {-^fc(A) > x}. So we have shown that for all x > and 
n > 1 



Q {M fc (A) >x} j < -. 



\k=l 



Taking the limit asn-> oo, and applying the monotone convergence theorem shows P (UfcLi {-^fe(A) > x}) < 
i, Or, P(U^Li {M k (X) > e x }) < e~ x . This then shows, using the definition of M k {\), that 



oo f n 



\J \ Y1 A \- Z * - Mi] - ^ (A) - x fJ- e ~ X - 

□ 



\n— 1 ^ i— 1 



A. 2 Proof of Lemma [2] 

Lemma [2j For any S > and / : .A i-> M, with probability at least 1 — S 



L 2 Af) > L 2 , t (fe) + 2 11/ - fe\\ 2 , Et ~ ^ lo § (V*) 



simultaneously for all t £ N. 
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We will transform our problem in order to apply the general exponential martingale result shown above. 
We set Ht-i to be the a-algebra generated by (H t , A t ,9). By previous assumptions, e t :— R t — fe(At) 
satisfies E[e t \H t -i] = and E[exp{AeJ | H t -i] < exp|^| a.s. for all A. Define Z t = (fe (A t ) - R t f - 
(f(A l )-R t f 

Proof. By definition Zt — L 2y T+i(fe)—L 2 ,T+i(f)- Some calculation shows that Z t = — (f(A t ) — fe(A t )) 2 - 
2 (f (A t ) — fe (A t )) et- Therefore, the conditional mean and conditional cumulant generating function satisfy: 



fit = E [Z t | Ut-i] = - (f (At) - fe (At)) 2 
^t(A) - logE [exp (A [Z t - fi t ]) \ H t -i] 



(2X[f(A t )-f e (A t )]) 2 a 2 



= \ogE [exp (2\ (f (A t ) ~ fe (A t )) e t )\Ht-i}< 

Applying Lemma [6] shows that for all x > 0, A > 

( t t .2 \ 

P[Y,\Z k <x-\J2(f (Ak) - fe (A k )) 2 + — (2f (A k ) - 2f e (A k )) 2 a 2 Vt e N J > 1 - e 

\k=l k=l ' J 

Or, rearranging terms 

p ( E z ^ f + i (/ {Ak) ~ h {Ak))2 ^ ~ ^ yt e N ) - 1 ~ e ~ x - 

\k=l k=l ) 

Choosing A = jV, % — log |, and using the definition of Z k implies 

P f L 2 ,t(/) > L 2tt (f e ) + \\\f- fe\\l Et - 4a 2 log (1/5) Vt e Nj > 1 - 5. 



□ 



2,E t 



< y/p*{F,8,a) \ for all t G N 



A. 3 Least Squares Bound - Proof of Proposition [6] 

Proposition j^J For all 5 > and a > 0, if Tt = |/ G J 7 : / - /Z^ 

t/ien 

pf/flG > i - 25. 

Proof. Let J 7 " C J- be an a-cover of J 7 in the sup-norm in the sense that for any f € J 7 there is an f a G J- a 
such that \ f a — /|| < e. By a union bound, with probability at least 1 — 5, 

L 2 M a ) ~ L 2 ,t(fe) > \\\f a ~ fe\\ 2 , Et - ^ 2 \og(\F a \ /5) Vt G N, / G T a . 
Therefore, with probability at least 1 — 6, for all t G N and / G T: 

L 2 Af)~L 2 Afe) >\\\f~ M\Ie-^ 2 log (|-F°| /S)+Mni^\\r - fo\\l Et - \ ||/ - + La,t(/) - ) 



Discretization Error 

Lemma[7J which we establish in the next section, asserts that with probability at least 1 — 5 the discretization 
error is bounded for all t by ar/t where r\ t '■= t 

has lower squared error than fe by definition, we find with probability at least 1 — 25 

1 ?LS f 2 

2 ft 2,E t 



8C + \/8o- 2 ln(4t 2 /<5) . Since the least squares estimate f^ s 
m 

<Aa 2 log(|J- Q |/,5) + m /f . 
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Taking the infimum over the size of a covers implies: 



fLS f 

It - h 



2,E t 



< ^SaHog(N(T, a, \\-\U/S) + 2ai lt ^ y/p*(T,6,a). 



□ 



A. 4 Discretization Error 

Lemma 7. If f a satisfies \\f — /"Ho-, < a, then with probability at least 1 — 5. 



1 2 1 2 

7, - fe\\ 2 ,E t ~ 9 11/ _ fs\\2,E t + L 2,t{f) ~ L 2,t{f a ) 



< at 



8C + y/8a 2 ln{U 2 /6)] Vi e N (7) 



Proof. Since any two functions in /, / Q £ J satisfy ||/ — /"H^ < C, it is enough to consider a < C. We 
find 



(D 2 («)-(/) 2 (a) 



< max 

— a < y < a 



(f(a) + y) z - f(a) 2 = 2f(a)a + a 2 < 2Ca + a 2 



which implies 



(r(a)~f e (a)) 2 -(f(a)-fe(a)) 2 



(R t -f(a)) 2 -(R t -.r(a)) 2 



\[(f a )(a) 2 -f(a) 2 ]+2f e (a) (f(a)~r(a))\<4Ca + a 2 
\2R t (/» - /(a)) + /(a) 2 - f a (a) 2 \ < 2a \R t \ + 2Ca + a 2 



Summing over t, we find that the left hand side of ^ is bounded by 

t-i /] x t-i 

J2 9 [ 4C " + « 2 ] + [2a\R k \ + 2Ca + a 2 ] < a^ (6C + 2\R k \) 



k=l 



fe=l 



Because e k is sub-Gaussian, P f |e fc | > ^ct 2 ln(2/6)) < <5. By a union bound, P (lk s.t. \e k \ > ^J2a 2 ln(4i 2 /<5)) < 



S Si° W — ^- Since \R k \ < C + \e k \ this shows that with probability at least 1 — S the discretization error 



is bounded for all t by ar\ t where rj t :— t 8C + 2^/2a 2 ln(4i 2 /<5) 



□ 



B Bounds on Margin Dimension for Common Function Classes 

Definition [3] which defines the margin dimension of a class of functions, can be equivalently written as 
follows. The e-margin dimension of a class of functions J- is the length of the longest sequence a\, ..,a T such 
that for some e' > e 



w k := sup { (f Pl - / P J (a k ) 



k-l 



(8) 



for each k < t. 



B.l Finite Action Spaces 

Any action is e'-dependent on itself since sup|(/ Pl — f Pl ) (a) : \J (f Pl — f P2 ) 2 (a) < e' pi,p% € G }> » '- 
Therefore, for all e > 0, the e-margin dimension of A is bounded by \A\. 
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B.2 Linear Case 

Proposition 10. Suppose C K d and fe(a) = 9 T (f>(a). Assume there exist constants 7, and S, such that 
for allaeA and p G 9, \\p\\ 2 < S, and ||<H a )ll 2 < 7. Then dim^^e) < 3^ In {3 + 3 ) 2 } + 1. 

To simplify the notation, define Wk as in |8| , <j) k = 4> (afc), P = Pi — P2, and $fc = Yli=i 4>i4>i ■ I 11 this case, 
Si=i (fpi ~ fp2) 2 ( a i) — P T ^kP, and by the triangle inequality |jp|| 2 < 25. The proof follows by bounding 
the number of times Wk > e' can occur. 



Step 1: If w k > e' then ^V^Vfe > \ where V k := $ fe + XI and A = 



Proof. We find w k < max{p T fe : p T <S> k p < (e') 2 , p T /p < (2S) 2 } < max{p T fe : p T V k p k < 2(e') 2 } = V 7 ^ 2 WMv^ 
The second inequality follows because any p that is feasible for the first maximization problem must satisfy 
p T V k p < {e') 2 + X(2S) 2 = 2(e') 2 - By this result, w k > e' implies ||0 fe ||y-i > 1/2. □ 



Step 2: If u?j > e' for each i < k then det V* > X d (f ) fc 1 and det V fe < + A*) . 

Proof. Since Vfc = Vk-i + 4>k4>k> using the Matrix Determinant Lemma, 

det y fe = detF fe _! (1 + tfV^fa) > detV k ^ (^j > ... > det [XI] (0 = X d (^j 



Recall that det Vfc is the product of the eigenvalues of V k , whereas trace [Vfc] is the sum. As noted in 12 
det Vfc is maximized when all eigenvalues are equal. This implies: det Vfc < ^ trac °[ v ' fc ] ^ < ^ 7 + X^j . □ 

Step 3: Complete Proof 

fc-i 

Proof. Manipulating the result of Step 2 shows k must satisfy the inequality: (|) d < cto [^j^] + 1 where 

cto = ("a") = (j^j ' B(x,a) = max|i3 : (1 + x) B < aB + l|. The number of times Wk > e' can 
occur is bounded by dB(l/2,ao) + 1- 

We now derive an explicit bound on B(x,a) for any x < 1. Note that any B > 1 must satisfy the 
inequality: In {1 + x} B < In {1 + a} + In B. Since ln{l + ir} > x/(l + x), using the transformation of 
variables y = B \xj (1 + x)\ gives: 

V < ln{l + a] + In + lnw < ln{l + a] + In + - => y < ( ln{l + a} + In — — ) . 

x x e e — 1 \ ^ / 

This implies -B(ir, a) < ^-^e^T ( m {l + a } + m "j^)- The claim follows by plugging in a — ao and x = 1/2. 

□ 

B.3 Generalized Linear Models 

Proposition 11. Suppose C K d and fg(a) = g(8 T (f>(a)) where g(-) is a differentiable and strictly increasing 
function. Assume there exist constants h, h, 7, and S , such that for all a G A and p 6 0, < h < 

g'{p T (t)(a)) < h, \\p\\ 2 < S, and \\<p(a)\\ 2 < 7. Then dim A/ (J", e) < 3dr 2 ^ In |3r 2 + 3r 2 ( 2 r 5 ) 2 | + 1- 

The proof follows three steps which closely mirror those used to prove Proposition fTOl 
Step 1: If w k > e' then 0j V k (f>u > ^2 where V k := $fc + XI and A = ( ^ J . 

Proof. By definition Wk < max|g(p T 0fc) : Yli=i 9 (p T( t>( a i)) 2 < ( £ ') 2 j P T Ip< (2<S) 2 j. By the uniform 
bound on g'(-) this is less than max {hp T 4> k ■ h 2 p T $kP< (t') 2 ,p T Ip< (25") 2 } < max {hp T 4> k ■ h 2 p T V k p < 2(e') 2 } 

V2( e ') 2 A 2 IWIw. ' □ 
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Step 2: If w t > e' for each i < k then det V k > X d (|) fc 1 and det V k < + a) . 

Step 3: Complete Proof 

Proof. The above inequalities imply k must satisfy: (l + ^) d < a [^p] where a = 7 2 /A. Therefore, 
as in the linear case, the number of times Wk > e' can occur is bounded by dB{^2 ,a a ) + 1. Plugging these 
constants into the earlier bound B(x, a) < i±5 (in {1 + a} + In i±^) and using 1 + x < 3/2 yields the 
result. □ 
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